Skip to content

fix(fetch): block autonomous fetching when robots.txt returns 5xx#3547

Open
ElliotJLT wants to merge 2 commits intomodelcontextprotocol:mainfrom
ElliotJLT:fix/fetch-robots-txt-5xx-handling
Open

fix(fetch): block autonomous fetching when robots.txt returns 5xx#3547
ElliotJLT wants to merge 2 commits intomodelcontextprotocol:mainfrom
ElliotJLT:fix/fetch-robots-txt-5xx-handling

Conversation

@ElliotJLT
Copy link

Summary

  • When robots.txt returns a 5xx server error, the fetch server currently parses the error page body as robots.txt content, finds no Disallow rules, and silently allows autonomous crawling
  • Per RFC 9309 Section 2.3.1.3, server errors mean crawlers should assume the site is fully restricted
  • This fix adds a 5xx check that blocks autonomous fetching with a clear message, consistent with the existing 401/403 handling

The bug

robots.txt returns 200 → parsed normally ✓
robots.txt returns 401/403 → blocked ✓
robots.txt returns 404 → allowed (no restrictions) ✓
robots.txt returns 500/503 → error page parsed as robots.txt → "no rules found" → allowed ✗

A temporarily-down robots.txt becomes a free pass to autonomously fetch any URL on that site.

Test plan

  • All 22 tests pass (20 existing + 2 new)
  • New tests verify 500 and 503 responses raise McpError
  • Existing tests confirm no regressions (401, 403, 404, 200 all unchanged)

🤖 Generated with Claude Code

ElliotJLT and others added 2 commits March 13, 2026 11:23
…te_branch, git_log, and git_branch

git_diff and git_checkout already reject user-supplied values starting
with '-' to prevent flag injection (even when a malicious ref exists via
filesystem manipulation). The same defense-in-depth pattern was missing
from four other functions:

- git_show: revision parameter passed directly to repo.commit()
- git_create_branch: branch_name and base_branch unchecked
- git_log: start_timestamp and end_timestamp passed to --since/--until
- git_branch: contains and not_contains passed as raw args to repo.git.branch()

Adds the same startswith("-") guards with matching tests for each function.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Per RFC 9309 Section 2.3.1.3, when a robots.txt fetch results in a
server error (5xx), crawlers should assume the site is fully restricted.

Previously, 5xx responses fell through to the robots.txt parser, which
would parse the error page HTML body, find no Disallow rules, and
silently allow crawling. This meant a temporarily-down robots.txt
became a free pass to autonomously fetch any URL on that site.

Now 5xx responses raise McpError with a clear message pointing users
to the manual fetch prompt, consistent with the 401/403 handling.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant